Geospatial Analysis of the 2023 Earthquakes in Turkey
Master Thesis
Master of Data Science
Gozde Yazganoglu (gozde.yazganoglu@cunef.edu)
1. Introduction ¶
In this notebook just as the previous one we will be using Pycaret to understand about clustering models.
In the geographic notebooks, we have observed that some clustering existing but we have only observed geographic locations. Here we will be able to observe in dataset as a whole and what else is affecting to these.
2. Importing libraries ¶
In this notebook we are using new_pycaret.yaml. In order to run local this environment should be installed.
In this notebook we use a basic pandas library in order not to have problems with the environment. However data reserves geographic information such as latitude, longitude, lags and distances. If exist clusters what are them and where are them?
#importing libraries from new_pycaret environment
from pycaret.clustering import *
import pandas as pd
import pickle
import numpy as np
3. Reading the Data ,Setup and Modeling¶
Clustering in PyCaret:
- Setup:
Just like other modules in PyCaret, we begin with the setup function, where we preprocess and set up the data for clustering.
- Model Creation:
We can create a clustering model using the create_model function. For example, to create a K-Means clustering model:
- Model Visualization:
We can visualize cluster results using various plots, like the Elbow plot, Silhouette plot, etc.
- Assigning Labels:
Once we've chosen the best number of clusters, we can assign the data points to the respective clusters.
Advantages of Clustering in PyCaret:
Simplicity and Efficiency:
PyCaret's clustering module allows users to quickly set up and execute clustering algorithms with minimal code.
Integrated Visualization:
The library comes with built-in visualization tools that make it easy to analyze and interpret clustering results.
Variety of Algorithms:
PyCaret provides a range of clustering algorithms including K-Means, Agglomerative, DBSCAN, and more.
Preprocessing Included:
PyCaret's setup function handles many preprocessing tasks, such as scaling, automatically. This is essential for clustering since algorithms like K-Means are sensitive to feature scales.
Flexibility:
We can customize the preprocessing pipeline or use external models and tools if needed.
#reading pandas dataframe
data = pd.read_csv('../data/processed/df.csv')
data.columns
Index(['obj_type', 'info', 'damage_gra', 'locality', 'population', 'income',
'total_sales', 'second_sales', 'water_access', 'elec_cons',
'building_perm', 'land_permited', 'labour_fource', 'unemployment',
'agricultural', 'life_time', 'hb_per100000', 'fertility', 'hh_size',
'longitude', 'latitude', 'nearest_water_source_distance',
'nearest_camping_distance', 'nearest_earthquake_distance',
'nearest_fault_distance', 'elev', 'percentage', 'damaged_percentage',
'destroyed_percentage', 'spatial_lag', 'lag_percentage',
'std_percentage', 'std_lag_percentage', 'lag_damaged_percentage',
'std_damaged_percentage', 'std_lag_damaged_percentage',
'lag_destroyed_percentage', 'std_destroyed_percentage',
'std_lag_destroyed_percentage', 'lag_nearest_water_source_distance',
'std_nearest_water_source_distance',
'std_lag_nearest_water_source_distance', 'lag_nearest_camping_distance',
'std_nearest_camping_distance', 'std_lag_nearest_camping_distance',
'lag_nearest_earthquake_distance', 'std_nearest_earthquake_distance',
'std_lag_nearest_earthquake_distance', 'lag_nearest_fault_distance',
'std_nearest_fault_distance', 'std_lag_nearest_fault_distance',
'lag_damage_gra', 'std_damage_gra', 'std_lag_damage_gra'],
dtype='object')
We excluded the category "damage_gra = 0" as it denotes buildings with an undetermined status.
Additionally, we omitted 'percentage' and 'std_damage_gra' since they closely correlate with the damage_gra value.
#Remowing the 0 damage grade
data = data[data['damage_gra'] != 0]
data.groupby('damage_gra').count()
#removing percentage since it is directly correlated with damage grade
data.drop('percentage', axis=1, inplace=True)
data.drop('std_damage_gra', axis=1, inplace=True)
data.drop('std_percentage', axis=1, inplace=True)
data.drop('std_lag_damage_gra', axis=1, inplace=True)
data.drop('lag_percentage', axis=1, inplace=True)
data.tail()
| obj_type | info | damage_gra | locality | population | income | total_sales | second_sales | water_access | elec_cons | ... | lag_nearest_camping_distance | std_nearest_camping_distance | std_lag_nearest_camping_distance | lag_nearest_earthquake_distance | std_nearest_earthquake_distance | std_lag_nearest_earthquake_distance | lag_nearest_fault_distance | std_nearest_fault_distance | std_lag_nearest_fault_distance | lag_damage_gra | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 98790 | 212_RAILWAYS | 997_NOT_APPLICABLE | 1 | TURKOGLU | 78976 | 5997 | 1938 | 536 | 0.95 | 4343 | ... | 0.012445 | -0.364072 | -0.364657 | 0.092105 | -1.097597 | -1.097274 | 0.031475 | -1.551232 | -1.552713 | 1.0 |
| 98791 | 212_RAILWAYS | 997_NOT_APPLICABLE | 1 | TURKOGLU | 78976 | 5997 | 1938 | 536 | 0.95 | 4343 | ... | 0.012365 | -0.362477 | -0.364976 | 0.092183 | -1.098330 | -1.097127 | 0.031408 | -1.547625 | -1.553434 | 1.0 |
| 98792 | 212_RAILWAYS | 997_NOT_APPLICABLE | 1 | TURKOGLU | 78976 | 5997 | 1938 | 536 | 0.95 | 4343 | ... | 0.012396 | -0.363093 | -0.364853 | 0.092170 | -1.098204 | -1.097152 | 0.031503 | -1.552762 | -1.552407 | 1.0 |
| 98793 | 212_RAILWAYS | 997_NOT_APPLICABLE | 1 | TURKOGLU | 78976 | 5997 | 1938 | 536 | 0.95 | 4343 | ... | 0.012494 | -0.365055 | -0.364460 | 0.092054 | -1.097125 | -1.097368 | 0.031509 | -1.553062 | -1.552347 | 1.0 |
| 98794 | 212_RAILWAYS | 997_NOT_APPLICABLE | 1 | TURKOGLU | 78976 | 5997 | 1938 | 536 | 0.95 | 4343 | ... | 0.012537 | -0.365915 | -0.364288 | 0.092011 | -1.096724 | -1.097448 | 0.031543 | -1.554889 | -1.551981 | 1.0 |
5 rows × 49 columns
Unlike classification object, in clustering we have to use normal setup as unique option.
K-Means and DBSCAN are popular clustering algorithms, but they operate on different principles and are suited to different types of data and use cases. Here's a comparison of the two:
K-Means:¶
Method:
Partitional clustering method. Iteratively assigns points to clusters by minimizing the sum of squared distances from points to their assigned cluster centers.
Number of Clusters:
Must be specified a priori. The choice of the number of clusters (k) can be guided by methods like the Elbow method, Silhouette score, etc.
Shape of Clusters:
Assumes that clusters are spherical and equally sized. Can struggle with non-spherical clusters or clusters of different densities.
Noise & Outliers:
Sensitive to noise and outliers, which can heavily influence the position of cluster centroids.
Initialization:
Depends on the initial placement of centroids. Common methods include random initialization and the K-Means++ initialization. Multiple runs with different initializations might be needed due to the possibility of convergence to local optima.
Scalability:
Relatively scalable, but can be computationally intensive for a very large number of data points. Variants such as MiniBatch K-Means can help in those cases.
DBSCAN (Density-Based Spatial Clustering of Applications with Noise):¶
Method:
Density-based clustering method. Groups together points that are closely packed together in the data space, marking low-density regions as outliers.
Number of Clusters:
Does not require the number of clusters to be specified in advance. Automatically determines clusters based on data density.
Shape of Clusters:
Can find arbitrarily shaped clusters. Works well with clusters of similar density.
Noise & Outliers:
Explicitly handles noise and outliers by classifying them as points not belonging to any cluster.
Initialization:
Does not depend on initialization as K-Means does.
Parameters:
Requires specification of two main parameters: eps (defines the radius around a data point to look for neighbors) and min_samples (the minimum number of points needed to form a dense region). The choice of these parameters can significantly affect clustering results, and they might not be intuitive to set.
Scalability:
Less scalable for very large datasets as it requires distance computation between points. However, optimized versions and approximations exist to make it more scalable.
cluster = setup(data)
kmeans = create_model('kmeans')
| Description | Value | |
|---|---|---|
| 0 | Session id | 6993 |
| 1 | Original data shape | (98272, 49) |
| 2 | Transformed data shape | (98272, 120) |
| 3 | Numeric features | 46 |
| 4 | Categorical features | 3 |
| 5 | Preprocess | True |
| 6 | Imputation type | simple |
| 7 | Numeric imputation | mean |
| 8 | Categorical imputation | mode |
| 9 | Maximum one-hot encoding | -1 |
| 10 | Encoding method | None |
| 11 | CPU Jobs | -1 |
| 12 | Use GPU | False |
| 13 | Log Experiment | False |
| 14 | Experiment Name | cluster-default-name |
| 15 | USI | 8f7c |
| Silhouette | Calinski-Harabasz | Davies-Bouldin | Homogeneity | Rand Index | Completeness | |
|---|---|---|---|---|---|---|
| 0 | 0.7315 | 645515079.2386 | 0.2921 | 0 | 0 | 0 |
list_kmeans = ['cluster','tsne','elbow','silhouette']
for plot in list_kmeans:
plot_model(kmeans, plot = plot)
dbscan = create_model('dbscan')
| Silhouette | Calinski-Harabasz | Davies-Bouldin | Homogeneity | Rand Index | Completeness | |
|---|---|---|---|---|---|---|
| 0 | 0.4529 | 82.9497 | 1.4016 | 0 | 0 | 0 |
list_dbscan = ['cluster','tsne','distance', 'silhouette', 'distribution']
for plot in list_dbscan:
plot_model(dbscan, plot=plot)
--------------------------------------------------------------------------- AttributeError Traceback (most recent call last) File ~/anaconda3/envs/pycaret/lib/python3.10/site-packages/yellowbrick/utils/wrapper.py:48, in Wrapper.__getattr__(self, attr) 47 try: ---> 48 return getattr(self._wrapped, attr) 49 except AttributeError as e: AttributeError: 'DBSCAN' object has no attribute 'cluster_centers_' The above exception was the direct cause of the following exception: YellowbrickAttributeError Traceback (most recent call last) File ~/anaconda3/envs/pycaret/lib/python3.10/site-packages/pycaret/internal/pycaret_experiment/tabular_experiment.py:1106, in _TabularExperiment._plot_model.<locals>.distance() 1105 visualizer = InterclusterDistance(estimator, **plot_kwargs) -> 1106 return show_yellowbrick_plot( 1107 visualizer=visualizer, 1108 X_train=self.X_train_transformed, 1109 y_train=None, 1110 X_test=None, 1111 y_test=None, 1112 name=plot_name, 1113 handle_test="", 1114 scale=scale, 1115 save=save, 1116 fit_kwargs=fit_kwargs, 1117 display_format=display_format, 1118 ) 1119 except Exception: File ~/anaconda3/envs/pycaret/lib/python3.10/site-packages/pycaret/internal/plots/yellowbrick.py:87, in show_yellowbrick_plot(visualizer, X_train, y_train, X_test, y_test, name, handle_train, handle_test, scale, save, fit_kwargs, display_format, **kwargs) 86 logger.info("Fitting Model") ---> 87 visualizer.fit(X_train, y_train, **fit_kwargs_and_kwargs) 88 elif handle_train == "fit_transform": File ~/anaconda3/envs/pycaret/lib/python3.10/site-packages/yellowbrick/cluster/icdm.py:291, in InterclusterDistance.fit(self, X, y) 289 # Get the centers 290 # TODO: is this how sklearn stores all centers in the model? --> 291 C = self.cluster_centers_ 293 # Embed the centers in 2D space and get the cluster scores File ~/anaconda3/envs/pycaret/lib/python3.10/site-packages/yellowbrick/utils/wrapper.py:50, in Wrapper.__getattr__(self, attr) 49 except AttributeError as e: ---> 50 raise YellowbrickAttributeError(f"neither visualizer '{self.__class__.__name__}' nor wrapped estimator '{type(self._wrapped).__name__}' have attribute '{attr}'") from e YellowbrickAttributeError: neither visualizer 'InterclusterDistance' nor wrapped estimator 'DBSCAN' have attribute 'cluster_centers_' During handling of the above exception, another exception occurred: TypeError Traceback (most recent call last) Cell In[11], line 4 1 list_dbscan = ['cluster','tsne','distance', 'silhouette', 'distribution'] 3 for plot in list_dbscan: ----> 4 plot_model(dbscan, plot=plot) File ~/anaconda3/envs/pycaret/lib/python3.10/site-packages/pycaret/utils/generic.py:965, in check_if_global_is_not_none.<locals>.decorator.<locals>.wrapper(*args, **kwargs) 963 if globals_d[name] is None: 964 raise ValueError(message) --> 965 return func(*args, **kwargs) File ~/anaconda3/envs/pycaret/lib/python3.10/site-packages/pycaret/clustering/functional.py:755, in plot_model(model, plot, feature, label, scale, save, display_format) 686 @check_if_global_is_not_none(globals(), _CURRENT_EXPERIMENT_DECORATOR_DICT) 687 def plot_model( 688 model, (...) 694 display_format: Optional[str] = None, 695 ) -> Optional[str]: 696 """ 697 This function analyzes the performance of a trained model. 698 (...) 753 754 """ --> 755 return _CURRENT_EXPERIMENT.plot_model( 756 model, 757 plot=plot, 758 feature_name=feature, 759 label=label, 760 scale=scale, 761 save=save, 762 display_format=display_format, 763 ) File ~/anaconda3/envs/pycaret/lib/python3.10/site-packages/pycaret/clustering/oop.py:175, in ClusteringExperiment.plot_model(self, estimator, plot, scale, save, fold, fit_kwargs, plot_kwargs, groups, feature_name, label, use_train_data, verbose, display_format) 100 def plot_model( 101 self, 102 estimator, (...) 114 display_format: Optional[str] = None, 115 ) -> Optional[str]: 116 """ 117 This function analyzes the performance of a trained model. 118 (...) 173 174 """ --> 175 return super().plot_model( 176 estimator, 177 plot, 178 scale, 179 save, 180 fold, 181 fit_kwargs, 182 plot_kwargs, 183 groups, 184 feature_name, 185 label, 186 use_train_data, 187 verbose, 188 display_format, 189 ) File ~/anaconda3/envs/pycaret/lib/python3.10/site-packages/pycaret/internal/pycaret_experiment/tabular_experiment.py:2052, in _TabularExperiment.plot_model(self, estimator, plot, scale, save, fold, fit_kwargs, plot_kwargs, groups, feature_name, label, use_train_data, verbose, display_format) 1939 def plot_model( 1940 self, 1941 estimator, (...) 1953 display_format: Optional[str] = None, 1954 ) -> Optional[str]: 1955 """ 1956 This function takes a trained model object and returns a plot based on the 1957 test / hold-out set. The process may require the model to be re-trained in (...) 2050 2051 """ -> 2052 return self._plot_model( 2053 estimator=estimator, 2054 plot=plot, 2055 scale=scale, 2056 save=save, 2057 fold=fold, 2058 fit_kwargs=fit_kwargs, 2059 plot_kwargs=plot_kwargs, 2060 groups=groups, 2061 feature_name=feature_name, 2062 label=label, 2063 use_train_data=use_train_data, 2064 verbose=verbose, 2065 display_format=display_format, 2066 ) File ~/anaconda3/envs/pycaret/lib/python3.10/site-packages/pycaret/internal/pycaret_experiment/tabular_experiment.py:1919, in _TabularExperiment._plot_model(self, estimator, plot, scale, save, fold, fit_kwargs, plot_kwargs, groups, feature_name, label, use_train_data, verbose, system, display, display_format) 1917 # execute the plot method 1918 with redirect_output(self.logger): -> 1919 ret = locals()[plot]() 1920 if ret: 1921 plot_filename = ret File ~/anaconda3/envs/pycaret/lib/python3.10/site-packages/pycaret/internal/pycaret_experiment/tabular_experiment.py:1122, in _TabularExperiment._plot_model.<locals>.distance() 1120 self.logger.error("Distance plot failed. Exception:") 1121 self.logger.error(traceback.format_exc()) -> 1122 raise TypeError("Plot Type not supported for this model.") TypeError: Plot Type not supported for this model.
4. Model Selection:¶
K-Means has superior Silhouette, Calinski-Harabasz, and Davies-Bouldin scores, suggesting better, more distinct, and well-separated clusters compared to DBSCAN.Both algorithms perform poorly in terms of homogeneity, Rand Index, and completeness. This might indicate that if there are ground-truth class labels, neither clustering algorithm aligns well with them. This is due to buildings do not spread homogenously.
Given the data, K-Means seems to be the better clustering model in terms of defining distinct clusters. We can try to profile what is the profile for these clusters.
# creating a dataframe with kmeans clusters
kmeans_df = assign_model(kmeans)
kmeans_df.tail()
| obj_type | info | damage_gra | locality | population | income | total_sales | second_sales | water_access | elec_cons | ... | std_nearest_camping_distance | std_lag_nearest_camping_distance | lag_nearest_earthquake_distance | std_nearest_earthquake_distance | std_lag_nearest_earthquake_distance | lag_nearest_fault_distance | std_nearest_fault_distance | std_lag_nearest_fault_distance | lag_damage_gra | Cluster | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 98790 | 212_RAILWAYS | 997_NOT_APPLICABLE | 1 | TURKOGLU | 78976 | 5997 | 1938 | 536 | 0.95 | 4343 | ... | -0.364072 | -0.364657 | 0.092105 | -1.097597 | -1.097274 | 0.031475 | -1.551232 | -1.552713 | 1.0 | Cluster 0 |
| 98791 | 212_RAILWAYS | 997_NOT_APPLICABLE | 1 | TURKOGLU | 78976 | 5997 | 1938 | 536 | 0.95 | 4343 | ... | -0.362477 | -0.364976 | 0.092183 | -1.098330 | -1.097127 | 0.031408 | -1.547625 | -1.553434 | 1.0 | Cluster 0 |
| 98792 | 212_RAILWAYS | 997_NOT_APPLICABLE | 1 | TURKOGLU | 78976 | 5997 | 1938 | 536 | 0.95 | 4343 | ... | -0.363093 | -0.364853 | 0.092170 | -1.098204 | -1.097152 | 0.031503 | -1.552762 | -1.552407 | 1.0 | Cluster 0 |
| 98793 | 212_RAILWAYS | 997_NOT_APPLICABLE | 1 | TURKOGLU | 78976 | 5997 | 1938 | 536 | 0.95 | 4343 | ... | -0.365055 | -0.364460 | 0.092054 | -1.097125 | -1.097368 | 0.031509 | -1.553062 | -1.552347 | 1.0 | Cluster 0 |
| 98794 | 212_RAILWAYS | 997_NOT_APPLICABLE | 1 | TURKOGLU | 78976 | 5997 | 1938 | 536 | 0.95 | 4343 | ... | -0.365915 | -0.364288 | 0.092011 | -1.096724 | -1.097448 | 0.031543 | -1.554889 | -1.551981 | 1.0 | Cluster 0 |
5 rows × 50 columns
kmeans_df.groupby('Cluster').describe()
| damage_gra | population | ... | std_lag_nearest_fault_distance | lag_damage_gra | |||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | mean | std | min | 25% | 50% | 75% | max | count | mean | ... | 75% | max | count | mean | std | min | 25% | 50% | 75% | max | |
| Cluster | |||||||||||||||||||||
| Cluster 0 | 50971.0 | 1.091483 | 0.439657 | 1.0 | 1.0 | 1.0 | 1.0 | 4.0 | 50971.0 | 1.394911e+06 | ... | 1.025527 | 1.741478 | 50971.0 | 1.095019 | 0.333754 | 0.0 | 1.0 | 1.0 | 1.0 | 4.0 |
| Cluster 1 | 10694.0 | 1.384982 | 0.849468 | 1.0 | 1.0 | 1.0 | 1.0 | 4.0 | 10694.0 | 3.341033e+05 | ... | -1.268409 | -1.019502 | 10694.0 | 1.401590 | 0.717199 | 1.0 | 1.0 | 1.0 | 1.4 | 4.0 |
| Cluster 2 | 11941.0 | 1.000586 | 0.030347 | 1.0 | 1.0 | 1.0 | 1.0 | 3.0 | 11941.0 | 2.170110e+06 | ... | 1.016423 | 1.496670 | 11941.0 | 0.999933 | 0.025229 | 0.0 | 1.0 | 1.0 | 1.0 | 1.6 |
| Cluster 3 | 24666.0 | 1.061137 | 0.349672 | 1.0 | 1.0 | 1.0 | 1.0 | 4.0 | 24666.0 | 1.813991e+05 | ... | -0.385643 | -0.211827 | 24666.0 | 1.060148 | 0.231372 | 0.4 | 1.0 | 1.0 | 1.0 | 4.0 |
4 rows × 368 columns
5. Visualization and Prediction of the model:¶
Thanks to pycaret, we were able to plot several graphs before with a few lines of code. Now I would like to visualize some other.
#tabular distribution for categorical variables.
print(pd.crosstab(kmeans_df['Cluster'], kmeans_df['obj_type']))
print(pd.crosstab(kmeans_df['Cluster'], kmeans_df['locality']))
print(pd.crosstab(kmeans_df['Cluster'], kmeans_df['info']))
obj_type 11_RESIDENTIAL_BUILDINGS 12_NON_RESIDENTIAL_BUILDINGS \ Cluster Cluster 0 9457 1587 Cluster 1 5124 123 Cluster 2 5953 114 Cluster 3 15395 456 obj_type 211_HIGHWAYS__STREETS_AND_ROADS 212_RAILWAYS 213_AIRFIELD \ Cluster Cluster 0 29199 315 10 Cluster 1 5447 0 0 Cluster 2 5552 0 5 Cluster 3 8371 18 0 obj_type 214_BRIDGES__ELEVATED_HIGHWAYS__TUNNELS_AND_SUBWAYS \ Cluster Cluster 0 8 Cluster 1 0 Cluster 2 0 Cluster 3 7 obj_type 22_PIPELINES__COMMUNICATION_AND_ELECTRICITY_LINES \ Cluster Cluster 0 7 Cluster 1 0 Cluster 2 0 Cluster 3 8 obj_type 23_COMPLEX_CONSTRUCTIONS_ON_INDUSTRIAL_SITES \ Cluster Cluster 0 1 Cluster 1 0 Cluster 2 2 Cluster 3 0 obj_type 24_OTHER_CIVIL_ENGINEERING_WORKS 995_UNCLASSIFIED Cluster Cluster 0 229 10158 Cluster 1 0 0 Cluster 2 315 0 Cluster 3 14 397 locality ADIYAMAN AFSIN ANTAKYA BAHCE DIYARBAKIR DUZICI ELBISTAN \ Cluster Cluster 0 0 712 0 0 890 0 1688 Cluster 1 0 0 8196 0 0 0 0 Cluster 2 0 0 0 0 0 0 0 Cluster 3 6772 0 0 3632 0 7995 0 locality ERDEMOÄ_LU GAZIANTEP GOLBASI ISLAHIYE KAHRAMANMARAS KIRIKHAN \ Cluster Cluster 0 0 24025 0 1404 12584 0 Cluster 1 0 0 0 0 0 2498 Cluster 2 0 0 0 0 0 0 Cluster 3 580 0 429 0 0 0 locality MALATYA NURDAGI OSMANIYE PAZARCIK SANLIURFA TURKOGLU Cluster Cluster 0 8269 817 0 97 0 485 Cluster 1 0 0 0 0 0 0 Cluster 2 0 0 0 0 11941 0 Cluster 3 0 0 5258 0 0 0 info 1211_HOTEL_BUILDINGS 1220_ADMINISTRATIVE 1221_INSTITUTIONAL \ Cluster Cluster 0 20 4 27 Cluster 1 0 0 0 Cluster 2 0 0 0 Cluster 3 1 0 1 info 1222_POLICE_STATION 1223_FIRE_STATION 122_OFFICE_BUILDINGS \ Cluster Cluster 0 13 6 11 Cluster 1 0 0 2 Cluster 2 0 0 0 Cluster 3 1 0 1 info 123_WHOLESALE_AND_RETAIL_TRADE_BUILDINGS \ Cluster Cluster 0 101 Cluster 1 0 Cluster 2 0 Cluster 3 0 info 1241_COMMUNICATION_BUILDINGS__STATIONS__TERMINALS_AND_ASSOCIATED_BUILDINGS \ Cluster Cluster 0 4 Cluster 1 0 Cluster 2 0 Cluster 3 1 info 1251_INDUSTRIAL_BUILDINGS 1252_RESERVOIRS__SILOS_AND_WAREHOUSES \ Cluster Cluster 0 108 1 Cluster 1 53 31 Cluster 2 0 0 Cluster 3 280 0 info ... 2141_BRIDGES_AND_ELEVATED_HIGHWAYS 2142_TUNNELS_AND_SUBWAYS \ Cluster ... Cluster 0 ... 8 0 Cluster 1 ... 0 0 Cluster 2 ... 0 0 Cluster 3 ... 6 1 info 2214_LONG_DISTANCE_ELECTRICITY_LINES \ Cluster Cluster 0 7 Cluster 1 0 Cluster 2 0 Cluster 3 5 info 221_LONG_DISTANCE_PIPELINES__COMMUNICATION_AND_ELECTRICITY_LINES \ Cluster Cluster 0 0 Cluster 1 0 Cluster 2 0 Cluster 3 1 info 2224_LOCAL_ELECTRICITY_AND_TELECOMMUNICATION_CABLES \ Cluster Cluster 0 0 Cluster 1 0 Cluster 2 0 Cluster 3 2 info 2301_CONSTRUCTIONS_FOR_MINING_OR_EXTRACTION \ Cluster Cluster 0 0 Cluster 1 0 Cluster 2 1 Cluster 3 0 info 2302_POWER_PLANT_CONSTRUCTIONS 2411_SPORTS_GROUNDS \ Cluster Cluster 0 1 39 Cluster 1 0 0 Cluster 2 1 32 Cluster 3 0 3 info 2412_OTHER_SPORT_AND_RECREATION_CONSTRUCTIONS 997_NOT_APPLICABLE Cluster Cluster 0 190 20024 Cluster 1 0 5311 Cluster 2 283 6261 Cluster 3 11 15830 [4 rows x 45 columns]
kmeans_df_map = kmeans_df[['latitude', 'longitude', 'Cluster']]
import matplotlib.pyplot as plt
# Create a scatter plot
plt.figure(figsize=(10, 8))
for cluster in kmeans_df_map['Cluster'].unique():
subset = kmeans_df_map[kmeans_df_map['Cluster'] == cluster]
plt.scatter(subset['longitude'], subset['latitude'], label=f'Cluster {cluster}')
plt.xlabel('Longitude')
plt.ylabel('Latitude')
plt.title('Clusters based on Latitude and Longitude')
plt.legend()
plt.grid(True)
plt.show()
kmeans_damage_gra = kmeans_df[['damage_gra', 'Cluster']]
kmeans_damage_gra.groupby('damage_gra').describe()
| Cluster | ||||
|---|---|---|---|---|
| count | unique | top | freq | |
| damage_gra | ||||
| 1 | 92806 | 4 | Cluster 0 | 48446 |
| 2 | 2010 | 4 | Cluster 0 | 1057 |
| 3 | 2083 | 4 | Cluster 1 | 945 |
| 4 | 1373 | 3 | Cluster 0 | 670 |
#to have more visualizations
kmeans_df.to_csv('../data/processed/kmeansdf.csv')